Dependencies for running the Rmd

Code is mostly excluded from the knit .html version of this notebook to maintain a clean presentation. It is included in a few places that make sense for demonstrative purposes. The full code is provided in the accompanying .Rmd file.

To run the .Rmd file, make sure the included dependencies Elasticsearch.R and elasticsearch_queries.R are in the same directory as the Rmd, and make sure to set “elasticsearch_host” to the approriate value here (this is not included in the github version for security reasons).

elasticsearch_host <- ""

Overview

Discuss the motivation of this twitter study and what it aims to accomplish. Specifically, what is the research question that is being investigated?

Methodology

##The dataset

Provide details on the dataset being used in this study. Include details on how the dataset was collected and provide references/citations if applicable.

If using the Rensselaer IDEA COVID-TweetIDs dataset, it can be referenced here: Rensselaer IDEA COVID-19 Tweet Dataset

If using the dataset from the paper “Extracting COVID-19 Events from Twitter”, it can be referenced here: Extracting COVID-19 Events from Twitter

Analysis methods

Discuss how the data is being used to answer the research question. Provide details on any statistical methods, aggregations, classification, clustering, etc. being used.

Results

Here, run the code to query the dataset from the appropriate elasticsearch index, execute the analysis, and visualize the results.

Query setup

# query start date/time (inclusive)
rangestart <- "2020-01-01 00:00:00"

# query end date/time (exclusive)
rangeend <- "2020-09-01 00:00:00"

# query semantic similarity phrase
semantic_phrase <- ""

# return results in chronological order or as a random sample within the range
# (ignored if semantic_phrase is not blank)
random_sample <- FALSE
# number of results to return (max 10,000)
resultsize <- 10000

Selection of optimal number of clusters and subclusters

To find the optimal number of high-level theme clusters for this sample, an elbow plot is used:

The plot mostly represents a smooth curve, although there is a distinct “elbow” point between k=8 and k=10. We will select k=8:

k <- 8

To find the optimal number of topic subclusters for each theme cluster, another elbow plot is generated with a separate curve for each theme cluster. Since the within sums of squares can be on different scales for theme clusters of different sizes and levels of diversity, the withinss metric is scaled to 0 mean and unit variance:

Each theme cluster follows a similar plot, again representing a smooth curve. This time there is no clear “elbow” point. A reasonable choice of k can be selected anywhere between 8 and 15. We will select cluster.k=8 for the topic subclusters:

cluster.k <- 8

Visualization of theme clusters and topic subclusters

## [1] "Subclustering cluster 1 ..."
## [1] "Subclustering cluster 2 ..."
## [1] "Subclustering cluster 3 ..."
## [1] "Subclustering cluster 4 ..."
## [1] "Subclustering cluster 5 ..."
## [1] "Subclustering cluster 6 ..."
## [1] "Subclustering cluster 7 ..."
## [1] "Subclustering cluster 8 ..."
## [1] "Plotting cluster 1 ..."
## [1] "Plotting cluster 2 ..."
## [1] "Plotting cluster 3 ..."
## [1] "Plotting cluster 4 ..."
## [1] "Plotting cluster 5 ..."
## [1] "Plotting cluster 6 ..."
## [1] "Plotting cluster 7 ..."
## [1] "Plotting cluster 8 ..."

Analysis

Present an analysis of the results obtained by your methods. For example:

Theme (cluster 1): prevent / cure / cures

5 closest tweets to theme cluster center
center_cosine_similarity full_text user_location
3527 0.6989163 “The problem is that we cannot cure #CoronaVirus, that’s why we need to mitigate it because there’s no vaccine for it” - @mattiaferraresi @robertmarawa #MSW #ReactionMonday Johannesburg, South Africa
461 0.6878533 There Are No Breakthrough Treatments For Coronavirus, So Don’t Fall For Internet “Cures” #coronavirus #internet #health #healthy #healthyliving #healthylifestyle Weston, Florida
737 0.6773909 Coronavirus: Though so-called naturopathic influencers on social media claim taking near-lethal doses of vitamin C is the cure for COVID-19, one expert says that vitamin C is unlikely to cure coronavirus. Glasgow
4599 0.6743958 @melissadderosa Lets get these doctors and nurses the one thing they really need to combat this devastating virus! ENU200 #coronavirus cure! But we need to have support to get that cure to those who need it most and stop this virus in its tracks to save lives #ENU200curescovid19 Global
3529 0.6574262 @AngelaBelcamino @realDonaldTrump The cure is remaining locked down for however long and waiting for the virus to die out which would in turn destroy the economy, the problem is COVID-19. Compton, CA

Topic (subcluster 1.1): hands / spread / wash

5 closest tweets to topic subcluster center
center_cosine_similarity full_text user_location
1018 0.7961717 "@ChidiOdinkalu: U can wear nose mask or wash ur hand to prevent against #CoronaVirus but nose mask does not prevent against #CoronaLeaders. The most deadly viruses are elected….! #CoronaVirusUpdate EARTH, WORLD
727 0.7456827 Washing your hands is the better way to prevent the spread of germs! #COVID2019 #coronavirus #WashYourHands TRUMP NATION
3199 0.7398792 The World Health Organisation is advising people to follow five simple steps to help prevent the spread of COVID-19: 1. Wash your hands 2. Cough/sneeze into your elbow 3. Don’t touch your face 4. Stay more than 3ft (1m) away from others 5. Stay home if you feel sick Pakistan
493 0.7219483 If you want to prevent yourself from #Coronavirus, then use #Palmist Hand Sanitizers. #Sanitizers #SanitizationMaximization #coronavirusinindia #CoronaVirusUpdate #CoronaVirusOutbreak #HealthForAll #CoronaAlert Delhi, India
3573 0.7102517 Everyone can help prevent the spread of COVID-19. One way is by washing your hands frequently with soap and water for at least 20 seconds. #DontBeASpreader #COVID19 #FlattenTheCurve #PinellasEM #PinellasMC Pinellas County, Florida

Discussion

Summarize the study and discuss key takeways.

Next steps for this analysis

Discuss interesting directions for follow-up investigation.

Limitations

Discuss any technical limitations or general assumptions made in the study that the reader should be aware of.

References

[1] …

[2] …

[3] …